Ranking and Clustering in Probabilistic Databases

نویسندگان

  • Jian Li
  • Barna Saha
  • Amol Deshpande
چکیده

The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we address the problem of on-the-fly clustering and ranking over probabilistic databases. We begin with a systematic exploration of ranking in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databasess, and we instead propose two parameterized ranking functions, called PRF w and PRF , that can approximate many of the previously proposed ranking functions. We present several novel algorithms for efficient computing such ranking functions using generating functions, even over databases that exhibit complex correlation patterns modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and develop an approach to learn such parameters. We also develop a hierarchical framework for efficiently combining on-the-fly clustering and ranking (called a ClusterRank query) over probabilistic databases. Our framework is based on a general definition of clustering, called restricted soft-t clustering, where a tuple is allowed to participate in at most t clusters. We show how several of our ranking functions can be seamlessly integrated into this framework, which not only allows ranking to continue in parallel with clustering, but also enables pruning of a large portion of the search space. Finally, we present a comprehensive experimental study comparing different ranking functions, and illustrating the effectiveness of our clustering framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Probabilistic Clustering Algorithms to Determine Mineralization Areas in Regional-Scale Exploration Studies

In this work, we aim to identify the mineralization areas for the next exploration phases. Thus, the probabilistic clustering algorithms due to the use of appropriate measures, the possibility of working with datasets with missing values, and the lack of trapping in local optimal are used to determine the multi-element geochemical anomalies. Four probabilistic clustering algorithms, namely PHC,...

متن کامل

ProUD: Probabilistic Ranking in Uncertain Databases

There are a lot of application domains, e.g. sensor databases, traffic management or recognition systems, where objects have to be compared based on vague and uncertain data. Feature databases with uncertain data require special methods for effective similarity search. In this paper, we propose an effective and efficient probabilistic similarity ranking algorithm that exploits the full informat...

متن کامل

Clustering and Ranking University Majors using Data Mining and AHP algorithms: The case of Iran

Abstract: Although all university majors are prominent and the necessity of their presences is of no question, they might not have the same priority basis considering different resources and strategies that could be spotted for a country. This paper focuses on clustering and ranking university majors in Iran. To do so, a model is presented to clarify the procedure. Eight different criteria are ...

متن کامل

RankingFor Web Databases Using SVM and K-Means algorithm

The Usage of internet in now a day is more and it became necessity for the people to do some applications such as searching web data bases in domains like Animation, vehicles, Movie, Real estates, etc. One of the problems in this context is ranking the results of a user query information. Earlier approaches problem have toused frequencies of database value regions, handling query logs, and user...

متن کامل

Top-k best probability queries and semantics ranking properties on probabilistic databases

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In probabilistic relational databases, the most common problem in answering top-k queries (ranking queries) is selecting the top-k result based on scores and top-k probabilities. In this paper, we firstly propose novel answers...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008